Gpu accelerated address translation for graphics virtualization

ABSTRACT

In accordance with embodiments disclosed herein, there are provided methods, systems, mechanisms, techniques, and apparatuses for implementing GPU (Graphics Processing Unit) accelerated address translation for graphics virtualization. In one embodiment, such a system includes a main memory having a plurality of machine physical addresses; a graphics processor unit having graphics memory therein; an address translation service integrated with the graphics processor unit; a hypervisor to manage one or more guest machines; wherein the hypervisor is to configure a lookup table within the graphics memory of the graphics processor unit; and further wherein the address translation service of the graphics processor unit is to translate a guest physical address for one of the one or more guest machines to a corresponding machine physical address within the main memory. Such a graphics processor unit may be implemented separate from a system, for example, embodied within a silicon integrated circuit.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present patent application is a continuation application claimingpriority from U.S. patent application Ser. No. 13/995,448, whose §371(c) date is Aug. 21, 2014, and titled: “GPU Accelerated AddressTranslation for Graphics Virtualization”, which is a U.S. National PhaseApplication under 35 U.S.C. § 371 of International Application No.PCT/CN2011/084327, filed Dec. 21, 2011, and titled: “GPU AcceleratedAddress Translation for Graphics Virtualization, both of which areincorporated herein by reference in their entirety.

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent file or records, but otherwise reserves all copyrightrights whatsoever.

TECHNICAL FIELD

The subject matter described herein relates generally to the field ofcomputing, and more particularly, to systems and methods forimplementing GPU (Graphics Processing Unit) accelerated addresstranslation for graphics virtualization.

BACKGROUND

The subject matter discussed in the background section should not beassumed to be prior art merely as a result of its mention in thebackground section. Similarly, a problem mentioned in the backgroundsection or associated with the subject matter of the background sectionshould not be assumed to have been previously recognized in the priorart. The subject matter in the background section merely representsdifferent approaches, which in and of themselves may also correspond toembodiments of the claimed subject matter.

A GPU or graphics processing unit provides special circuitry designed torapidly manipulate and alter memory in such a way so as to acceleratethe building of images in a frame buffer intended for output to adisplay. GPUs are used in embedded systems, mobile phones, personalcomputers, workstations, and game consoles and are very efficient atmanipulating computer graphics, and their highly parallel structuremakes them more effective than general-purpose CPUs for algorithms whereprocessing of large blocks of data is done in parallel.

Virtualization provides the capability for multiple operating systems tosimultaneously share processor resources in a secure and efficientmanner. When implementing virtualization, it is necessary to translateaddresses between a guest physical address of a virtualized guestmachine (e.g., a virtual machine or a “VM”) into the correspondingmachine physical address of the underlying physical hardware upon whichthe resources for the guest machine are virtualized.

For example, when an operating system (OS) is running inside a virtualmachine, the operating system does not usually know the correspondingmachine physical addresses of memory that it accesses. Direct access tothe computer hardware is therefore complicated because if the guestoperating system attempts to instruct the underlying hardware to performa direct memory access (DMA) using the virtual machine's guest-physicaladdresses the instruction would likely corrupt the memory as theunderlying hardware is unaware of the mapping or required translationbetween the guest physical address and the machine physical address forthe virtual machine. A hypervisor managing the virtual machine canprevent such corruption, but the problem of address translationnevertheless remains.

An input/output memory management unit (IOMMU) can solve the problem oftranslation by re-mapping the addresses accessed by the underlyinghardware according to a translation table that is used to map guestphysical addresses to machine physical addresses. IOMMU technology suchas VT-d or “Virtualization Technology for Directed I/O” can be leveragedto provide the necessary translation capability on behalf of the virtualmachine and the hypervisor when the requisite circuitry and chipset isavailable. VT-d is a type of an IOMMU may be included with some chipsetsto accompany a CPU.

Unfortunately, not all chipsets include the IOMMU or VT-d technology.For example, some Atom based platforms, tablets, handheld smartphones,and notebook computers lack the necessary circuitry to provide aconventional VT-d capability.

Device drivers within virtual machines do not function properly withoutDMA address translation. Software solutions to perform addresstranslation for DMA operations have been attempted, for example,implemented within a hypervisor. However, performance of software basedaddress translation is very poor. For example, 3D performance has beenmeasured to be approximately 40% of a native VT-d type solution. Worseyet, software based solutions were measured to contribute about 90% ofthe total overhead when software within a hypervisor was utilized toperform the address translation.

Such an inefficient use of resources is unacceptable with today's mobilecomputing devices which strive for energy efficiency over purecomputational processing horsepower. A more efficient solution istherefore necessary.

The present state of the art may therefore benefit from systems andmethods for implementing GPU (Graphics Processing Unit) acceleratedaddress translation for graphics virtualization as described herein.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments are illustrated by way of example, and not by way oflimitation, and will be more fully understood with reference to thefollowing detailed description when considered in connection with thefigures in which:

FIG. 1A illustrates an exemplary architecture of graphics processor unit(GPU) in accordance with which embodiments may operate;

FIG. 1B illustrates an alternative exemplary architecture in whichembodiments may operate;

FIG. 2A depicts interactions between a hypervisor and a GPU whichimplements address translation services in accordance with the disclosedembodiments;

FIG. 2B depicts logical memory address to guest physical addresstranslation 201 in accordance with the disclosed embodiments;

FIG. 2C depicts Guest physical Frame Number (GFN) to Machine FrameNumber (MFN) mapping in accordance with the disclosed embodiments;

FIG. 3 illustrates an exemplary system into which a graphics processorunit which implements address translation services may be integrated,installed, or configured, in accordance with one embodiment;

FIG. 4 is a flow diagram illustrating a method for implementing GPU(Graphics Processing Unit) accelerated address translation for graphicsvirtualization in accordance with described embodiments;

FIG. 5 illustrates a diagrammatic representation of a machine in theexemplary form of a computer system, in accordance with one embodiment;

FIG. 6 depicts a block diagram of a system in accordance with oneembodiment;

FIG. 7 is a block diagram of a computer system in accordance with oneembodiment;

FIG. 8 is a block diagram of a computer system in accordance with oneembodiment;

FIG. 9 depicts a tablet computing device and a hand-held smartphone eachhaving a circuitry integrated therein as described in accordance withthe embodiments;

FIG. 10 is a block diagram of an embodiment of tablet computing device,a smart phone, or other mobile device in which touchscreen interfaceconnectors are used;

FIG. 11 is a block diagram of an IP core development system inaccordance with one embodiment;

FIG. 12 illustrates an architecture emulation system in accordance withone embodiment; and

FIG. 13 illustrates a system to translate instructions in accordancewith one embodiment;

DETAILED DESCRIPTION

Described herein are systems and methods for implementing GPU (GraphicsProcessing Unit) accelerated address translation for graphicsvirtualization. In one embodiment, such a system includes a main memoryhaving a plurality of machine physical addresses; a graphics processorunit having graphics memory therein; an address translation serviceintegrated with the graphics processor unit; a hypervisor to manage oneor more guest machines; wherein the hypervisor is to configure a lookuptable within the graphics memory of the graphics processor unit; andfurther wherein the address translation service of the graphicsprocessor unit is to translate a guest physical address for one of theone or more guest machines to a corresponding machine physical addresswithin the main memory. Such a graphics processor unit may beimplemented separate from a system, for example, embodied within asilicon integrated circuit.

Practice of the disclosed embodiments provides an efficient and capablehardware based mechanism to implement necessary address translation whenan IOMMU chipset, such as a VT-d capability is not available. Moreover,practice of the disclosed embodiments may be utilized even where anIOMMU chipset, such as a VT-d capability is available, as practice ofthe disclosed embodiments may yield improved results.

For example, in order to vastly reduce the address translation overheadassociated with a software-centric solution, and to potentially reducethe overhead involved with an IOMMU chipset/VT-d based solution, thedisclosed embodiments teach a GPU based address translation service(ATS) to offload address translation overhead from another entitydirectly onto the GPU.

Improved graphics performance may be realized in virtual machinesthrough practice of the disclosed embodiments in which GPU based addresstranslation is utilized for graphics pass-through in a non VT-d capableplatform including a GPU Based Address Translation Services for PCIdevice pass-through, such as DMA address translation of GFN to MFNaddresses (e.g., guest physical addresses to machine physicaladdresses). Graphics sharing among and between different virtualmachines may additionally be implemented utilizing the disclosedembodiments.

Using an extension the GPU is enabled to perform translation ofaddresses in a virtualization mode, thus allowing the GPUs hardware toperform the translation without having to rely on inefficient softwareimplementations and without the necessity for a dedicated IOMMU in theCPU such as VT-d capability or other special support from the CPU orchipset. For example, the VMM may program registers to tell the GPU touse a pass-through mode and the VMM then writes an address translationtable into the GPU via the GPU's shared memory.

As the use of client virtualization becomes more commonplace,performance of such virtualized machines will remain a key challenge.Although VT-d has aided with performance on systems embodying thenecessary CPU and chipset circuitry, the newer mobile computing devicessuch as tablets, net-books, and handheld smartphones very often do notincorporate VT-d capabilities, yet nevertheless make use of virtualmachines. The VT-d capabilities are often dropped from designs to meetthe demanding constraints of a small form factor mobile device andadditionally due to the emphasis placed upon energy efficiency whichtranslates directly to extended battery life.

Practice of the disclosed embodiments may therefore enable efficientsmall form-factor devices such as tablets and smartphones tonevertheless support GPU based address translation services so as toprovide a high-performance solution in the absence of VT-d capabilitiesand without having to resort to inefficient software solutions.

Further still, practice of the disclosed embodiments requires no specialsupport or modification to the graphics driver stack and may achievesimilar or better performance when compared with VT-d solutions.

The following definitions are provided for acronyms used throughout thedisclosure that follows:

GFN: Guest physical Frame Number. For example, addresses the Guest orvirtual machine thinks are hardware addresses being used in guest pagetables.

MFN: Machine Frame Number. For example, the Actual hardware addresses inthe underlying hardware.

VM: Virtual Machine or Guest Machine.

VMM: A Virtual Machine Monitor or Hypervisor for VMs or Guest Machines.

GTT: Graphics Translation Table for virtual memory.

GGTT: Global Graphics Translation Table. For example, a single commontranslation table used for all processes.

PPGTT: Per-Process Graphics Translation Table.

DMA: Direct Memory Access. For example, a memory address which is notfacilitated by a host operating system.

GPU: Graphics Processing Unit.

IOMMU: Input/Output Memory Management Unit.

VT-d: “Virtualization Technology for Directed I/O” which is animplementation of an IOMMU in some CPUs.

TLB: Translation Lookaside Buffer.

ATS: Address Translation Service.

In the following description, numerous specific details are set forthsuch as examples of specific systems, languages, components, etc., inorder to provide a thorough understanding of the various embodiments. Itwill be apparent, however, to one skilled in the art that these specificdetails need not be employed to practice the disclosed embodiments. Inother instances, well known materials or methods have not been describedin detail in order to avoid unnecessarily obscuring the disclosedembodiments.

In addition to various hardware components depicted in the figures anddescribed herein, embodiments further include various operations whichare described below. The operations described in accordance with suchembodiments may be performed by hardware components or may be embodiedin machine-executable instructions, which may be used to cause ageneral-purpose or special-purpose processor programmed with theinstructions to perform the operations. Alternatively, the operationsmay be performed by a combination of hardware and software, includingsoftware instructions that perform the operations described herein viamemory and one or more processors of a computing platform.

Embodiments also relate to a system or apparatus for performing theoperations herein. The disclosed system or apparatus may be speciallyconstructed for the required purposes, or it may comprise a generalpurpose computer selectively activated or reconfigured by a computerprogram stored in the computer. Such a computer program may be stored ina non-transitory computer readable storage medium, such as, but notlimited to, any type of disk including floppy disks, optical disks,flash, NAND, solid state drives (SSDs), CD-ROMs, and magnetic-opticaldisks, read-only memories (ROMs), random access memories (RAMs), EPROMs,EEPROMs, magnetic or optical cards, or any type of media suitable forstoring non-transitory electronic instructions, each coupled to acomputer system bus. In one embodiment, a non-transitory computerreadable storage medium having instructions stored thereon, causes oneor more processors within a system to perform the methods and operationswhich are described herein. In another embodiment, the instructions toperform such methods and operations are stored upon a non-transitorycomputer readable medium for later execution.

The algorithms and displays presented herein are not inherently relatedto any particular computer or other apparatus nor are embodimentsdescribed with reference to any particular programming language. It willbe appreciated that a variety of programming languages may be used toimplement the teachings of the embodiments as described herein.

FIG. 1A illustrates an exemplary architecture of graphics processor unit(GPU) 100 in accordance with which embodiments may operate.

In one embodiment, the graphics processor unit 100 is embodied within asilicon integrated circuit. In one embodiment, the graphics processorunit 100 includes a shared graphics memory interface 155 to receive aprepared lookup table 156 from a hypervisor and a graphics memory 116 tostore the prepared lookup table 156 therein as a stored lookup table117. In such an embodiment, the prepared lookup table 156 provides amapping of Guest physical Frame Numbers (GFNs) 158 to Machine FrameNumbers (MFNs) 159.

In one embodiment, the graphics processor unit 100 further includes anaddress translation service 115 unit to receive a Guest physical FrameNumber for translation. In one embodiment, the graphics processor unit100 includes and utilizes a Translation Lookaside Buffer (TLB) 110B toretrieve a Machine Frame Number corresponding to the received Guestphysical Frame Number from the prepared lookup table 156 to fulfill thetranslation. The address translation service 115 may retrieve the GFN toMFN mappings through the Translation Lookaside Buffer 110B integratedwithin the graphics processor unit 100. In one embodiment, the TLBprovides an indexable cache on behalf of the address translation service115 of the graphics processor unit 100.

In one embodiment, the graphics processor unit 100 performs addresstranslation services 115 on behalf of one or more guest machinesoperating within a tablet computing device or a smartphone.

In one embodiment, the graphics processor unit 100 further includes oneor more graphics virtualization registers 114. For example, the one ormore graphics virtualization registers 114 may inform the graphicsprocessor unit 100 that a guest machine has been assigned to thegraphics processor unit 100 by a hypervisor. The one or more graphicsvirtualization registers 114 may further identify where the preparedlookup table 156 is stored in the graphics memory 116 as a stored lookuptable 117. For example, the one or more graphics virtualizationregisters 114 may identify where the lookup table 117 is located for anassigned guest machine.

In one embodiment, the graphics processor unit 100 further includes agraphics virtualization determination 113 component integrated therein.For example, the graphics virtualization determination 113 component maydetermine whether or not the graphics processor unit 100 is operatingwithin a virtualization environment on behalf of one or more guestmachines. In one embodiment, the graphics processor unit 100 engages theaddress translation service 115 when operating within the virtualizationenvironment as determined by the graphics virtualization determination113 component, but otherwise, address translation service 115 isbypassed.

FIG. 1B illustrates an alternative exemplary architecture 101 in whichembodiments may operate. The newly introduced GPU based addresstranslation service (ATS) extends the existing logical-to-physicalmapping of the GPU to support translating a guest physical address to acorresponding machine physical address.

As depicted, an extension to the graphics device 105 is provided whichfeeds into the graphics memory (GM) offset removal 106 component. Asshown, the graphics memory addresses are zero (“0”) based, which is theybegin at zero and proceed to a maximum allocated region or space.Graphics memory-capable internal functions and cache 107 is depicted asbeing communicably interfaced between the GM range offset removal 106component and the tiled address determination 112 component. Fenceregisters surface parameters 108 feeds into the tiled addressdetermination 112 component which in turn is communicably interfaced tothe logical memory mapping component 111. Intermediately interfaced isthe address tiling logic 109.

Graphics virtualization registers 114 feeds into the graphicsvirtualization determination 113 component which in turn is communicablyinterfaced with the address translation service (ATS) noted above.Graphics virtualization registers 114 are used to let the GPU knowwhether the graphics adapter has been assigned to a VM and to inform itas to where the lookup table is located. The graphics virtualizationdetermination 113 component is used to determine whether the graphicsadapter is working in virtualization environment. If the graphicsadapter works in virtualization environment it will call addresstranslation service 115 to convert the guest physical address to amachine physical address. The address translation service 115 retrievesthe machine physical address from the lookup table according to theguest physical address.

Translation lookaside buffers (TLBs) 110B are communicably interfacedwith both the address translation service 115 and also the lookup table117 within the graphics memory 116 of the depicted architecture 101. Inaccordance with one embodiment, the lookup table 117 is configured by ahypervisor communicably interfaced with the architecture 101 of the GPUand can cover zero to four gigabytes of guest memory space. The lookuptable 117 maps the GFN (Guest Physical frame number) to MFN (MachineFrame Number).

Translation lookaside buffers (TLBs) 110B perform a graphics memoryaccess on behalf of the address translation service 115. A translationlookaside buffer or TLB as depicted at 110A and 110B is a processorcache used to improve virtual address translation speed. TLBs 110A and110B may be implemented as content-addressable memory (CAM) in which theCAM search key is the virtual address and the search result is aphysical address, thus yielding the desired GFN to MFN addressestranslation (e.g., a guest physical address to a machine physicaladdress) in accordance with one embodiment.

Where the requested address is present in the TLB 110A or 110B, the CAMsearch yields a match quickly and the retrieved physical address can beused to access memory, for example, the returned machine physicaladdress may then be utilized to access a physical address of theunderlying hardware, despite the original instruction providing avirtualized address associated with a guest physical address.Alternatively, the address translation service 115 may conduct a pagewalk where necessary or compute the address when feasible.

Once translated by the address translation service 115, a machinephysical address may be utilized to access main memory 118. Main memory118 is in communication with global translation table (GTT) 119 which isin turn communicably interfaced with TLBs at 110A, for example,responsive to a PTE fetch operation (e.g., a Page Table Entries fetchoperation).

FIG. 2A depicts interactions 200 between a hypervisor 205 and a GPU 220which implements address translation services in accordance with thedisclosed embodiments.

Graphics adapter pass-through is depicted in which a GPU is used toperform address translation for the graphics pass-through. Practice ofsuch an embodiment may boost the overall graphics performance forvirtual machines in a non VT-d supported platform. In particular, alookup table to perform address translation within the GPU is usedresulting in faster performance over VT-d based solutions and in vastlysuperior performance over software based solutions.

The depicted interactions 200 show how to set-up lookup table ingraphics adapter in accordance with one embodiment. For example, when auser wants to assign physical graphics adapter to a virtual machine, thehypervisor 205 will issue one set up lookup table session to transfer anentire 4 k aligned Guest-to-physical address mapping to the GPU 220.

In an IA32 architecture based machine, 1M*4 memory space (e.g., 4megabytes) is large enough to cover an entire 4G (4 gigabytes) of guestmemory space. Using the lookup table 117, the GPU 220 may calculate thefinal machine address using the following formula:

MFN=*((unsigned int*)(pPhysicaIGraphicsMemory+(GFN<<2)))

Beginning from the hypervisor 205 on the left hand side, enablevirtualization at block 206 is initiated which communicates aconfiguration 225 to the GPU 220. The GPU 220 on the right hand sideconfigures virtualization registers at block 207 (e.g., passing theconfiguration 225 to set “vReg.enable=1”) and a success 226 message isresponsively communicated from GPU 220 to hypervisor 205.

The hypervisor 205 then prepares the lookup table at block 208. Thehypervisor 205 at block 210 transfers the lookup table 117 to the GPU220. The GPU 220 at block 211 issues one DMA request to fetch the lookuptable 117 and stores the lookup table 117 in graphics memory. Uponsuccessful completion, the GPU 220 communicates a response 228 back tohypervisor 205 indicating completion.

Many hypervisors support advanced memory management mechanisms.Implementation of the disclosed embodiments provides support for alookup table update transaction. Graphics sharing among VMs is alsosupported by distinguishing among the virtual machines within the lookuptable 117.

At block 212 the hypervisor 205 initiates a lookup table update (asrequired) and communicates the update 229 to the GPU 220. At block 213,the GPU 220 performs the lookup table update (as required) responsive toreceiving the update 229. The GPU 220 then communicates a response 230back to the hypervisor 205 indicating completion of the lookup tableupdate.

FIG. 2B depicts logical memory address to guest physical addresstranslation 201 in accordance with the disclosed embodiments. As notedabove, overhead in a software based address translation service has beenfound to be significant (e.g., approximately 90% of total overhead) andperformance suffers dramatically (e.g., a 3D graphics rendering wasfound to be approximately 40% the performance of a native hardwaresolution). Thus, practice of the disclosed embodiments may offloadoverhead from a software based solution onto a GPU which implementsaddress translation services. In accordance with one embodiment, addresstranslation may utilize two distinct mechanisms.

In one embodiment, a first mechanism is translation of logical memoryaddresses to guest physical addresses. Legacy mechanisms may be utilizedto achieve this translation. For example, if the GPU is not in avirtualization environment, it may skip the GPU's address translationservice and access the main memory using MFN=GFN. However, if the GPU isoperating in a virtualization environment, then the GPU's addresstranslation service is called to perform an address translationaccording to the formula:

MFN=*((unsigned int*)(pPhysicaIGraphicsMemory+(GFN<<2)))

As depicted, an offset into a 4 KB page 241 is provided at bits 0 to 11and a logical page number 240 is provided at bits 12 to 31. The logicalpage number 240 is communicated to TLB 243 which is interfaced withGTT/PPGTT 242 (e.g., global translation table and per-process globaltranslation table). From TLB 243 the 36-bit addressing extension 244 isformed at bits 32 to 35 and the physical page number 245 is formed atbits 12 to 31 replacing the logical page number 240. The offset into a 4KB page 246 is carried down replacing offset into a 4 KB page 241 atbits 0 to 11, and thus completing the logical memory address to guestphysical address translation 201.

Extra cache operations may additionally be utilized to accelerate theaddress translation operation.

Notably, the address translation service of the GPU need not be engagedas the lack of a virtualization environment negates the need forperforming GFN to MFN mapping.

FIG. 2C depicts Guest physical Frame Number (GFN) to Machine FrameNumber (MFN) mapping 202 in accordance with the disclosed embodiments.For example, where the GPU is operating in a virtualization environment.

As depicted, graphics memory 116 of a GPU holds the lookup table 117 inwhich the guest memory ranges from zero (0) to a maximum size of anallocated region. A guest physical frame number 266 is passed inresulting in GFN 260 at bits 12 to 31. An offset into a 4 KB page 261 isagain depicted at bits 0 to 11. GFN 260 is passed to TLB 263 which iscommunicably interfaced with the lookup table 117 in the graphics memory116. MFN 264 is then responsively provided as set forth at bits 12 to 31and an offset into a 4 KB page 265 is carried down replacing offset intoa 4 KB page 264 at bits 0 to 11, and thus completing the GFN to MFNmapping 202.

FIG. 3 illustrates an exemplary system 300 into which a graphicsprocessor unit 100 which implements address translation services 115 maybe integrated, installed, or configured, in accordance with oneembodiment. System 300 includes a main memory 118 and a centralprocessor unit 396 without VT-d integrated therein. System 300 includescommunication bus(es) 315 to transfer data within system 300 and ahypervisor 390 to manage one or more guest machines (VMs) 338.

Depicted separately is graphics processor unit (GPU) 100 which may bemanufactured and sold separate from the system 300 but later configuredand integrated with such a system 300. In accordance with oneembodiment, such a system includes: main memory 118 having a pluralityof machine physical addresses; the graphics processor unit 100 havinggraphics memory 116 therein; an address translation service 115integrated with the graphics processor unit 100; and the hypervisor 390to manage one or more guest machines 338. In such an embodiment, thehypervisor 390 configures a lookup table 117 within the graphics memory116 of the graphics processor unit 100. In such an embodiment, theaddress translation service 115 of the graphics processor unit 100translates a guest physical address for one of the one or more guestmachines 338 to a corresponding machine physical address within the mainmemory 118 of the system.

In one embodiment, such a system 300 further includes one or moregraphics virtualization registers 114 within the GPU 100, in which oneor more graphics virtualization registers 114 inform the graphicsprocessor unit 100 that one of the guest machines 338 has been assignedto the graphics processor unit 100 by the hypervisor 390. Such a system300 may further include a graphics virtualization determination 113component integrated within the graphics processor unit 100 to determinewhether or not the graphics processor unit 100 is operating within avirtualization environment on behalf of one of the guest machines 338.When operating within the virtualization environment the addresstranslation service 115 is engaged as set forth above in FIG. 2Cdepicting Guest physical Frame Number (GFN) to Machine Frame Number(MFN) mapping 202. However, when not operating within the virtualizationenvironment, the system 300 implements logical memory address to guestphysical address translation without engaging the address translationservice 115 of the graphics processor unit 100, as set forth above inFIG. 2B depicting logical memory address to guest physical addresstranslation 201.

In one embodiment, the hypervisor 390 engages the address translationservice 115 of the graphics processor unit 100 by configuring one ormore graphics virtualization registers 114 to inform the graphicsprocessor unit 100 that one of the guest machines 338 have been assignedto the graphics processor unit 100 by the hypervisor 390. The hypervisor390 may include or be implemented via a virtual machine manager and theone or more guest machines 338 may include or be implemented via one ormore virtual machines.

In one embodiment, the hypervisor 390 passes a Guest physical FrameNumber (GFN) to the graphics processor unit 100 for translation by theaddress translation service 115 to a Machine Frame Number (MFN). In suchan embodiment, the GFN represents the guest physical address for one ofthe one or more guest machines 338 and the MFN represents the machinephysical addresses within the main memory 118 of the system,corresponding to the guest physical address. In one embodiment, theaddress translation service 115 retrieves Guest physical Frame Number(GFN) to Machine Frame Number (MFN) mappings from the lookup table 117in the graphics memory 116.

In one embodiment, the system 300 includes a separate and distinctcentral processor unit 396 communicably interfaced with the graphicsprocessor unit 100 via a system bus 315. In such an embodiment, thecentral processing unit lacks dedicated hardware circuitry to performaddress translation of guest physical addresses to machine physicaladdresses and is thus forced to either rely upon the GPU's 100 addresstranslation services 115 or perform inefficient software translation inthe absence of a GPU 100 as is described herein.

In one embodiment, the system 300 utilizes the graphics processor unit100 as a microprocessor within a tablet computing device or a smartphone or one of a plurality of microprocessors integrated within thetablet computing device or the smartphone. For example, such a tabletcomputing device or smartphone may include the central processor unit396 which lacks an IOMMU or VT-d capability.

In one embodiment, the graphics memory 116 includes shared graphicsmemory 116; and the hypervisor 390 configures the lookup table 117within the graphics memory by writing lookup table 117 directly into theshared graphics memory 116 of the graphics processor unit 100, forexample, via a shared graphics memory interface 155. In an alternativeembodiment, the hypervisor 390 configures the lookup table 117 withinthe graphics memory 116 by instructing the graphics processor unit 100to retrieve and store the lookup table 117 and the graphics processorunit 100 responsively issues a Direct Memory Access (DMA) request tofetch the lookup table 117 and then proceeds to store the lookup table117 in the graphics memory 116.

In one embodiment, the hypervisor 390 issues a lookup table update tothe graphics processor unit 100 and the graphics processor unit 100responsively updates the lookup table 117 in the graphics memory 116.

FIG. 4 is a flow diagram 400 illustrating a method for implementing GPU(Graphics Processing Unit) accelerated address translation for graphicsvirtualization in accordance with described embodiments. Method 400 maybe performed by processing logic that may include hardware (e.g.,circuitry, dedicated logic, programmable logic, microcode, etc.),software (e.g., instructions run on a processing device to perform themethodologies and operations described herein. Some of the blocks and/oroperations of method 400 are optional in accordance with certainembodiments. The numbering of the blocks presented is for the sake ofclarity and is not intended to prescribe an order of operations in whichthe various blocks must occur.

Method 400 begins with processing logic managing one or more guestmachines on a system (block 405).

At block 410, processing logic configures a lookup table with a mappingof guest physical addresses for the one or more guest machines tocorresponding machine physical addresses.

At block 415, processing logic enables virtualization in the GPU bycommunicating a configuration request from the hypervisor to the GPU.

At block 420, processing logic internal to the hypervisor prepares thelookup table.

At block 425, processing logic transfers the prepared lookup table tothe GPU by writing the lookup table into shared graphics memory of theGPU or instructing the GPU to retrieve and store the prepared lookuptable via a DMA request.

At block 430, processing logic receives an access request to a mainmemory of the system from one of the guest machines.

At block 435, processing logic engages an address translation serviceinternal to the GPU by passing the guest physical address to the GPU.

At block 440, processing logic passes a GFN from the hypervisor to theGPU requesting translation to an MFN.

At block 445, processing logic retrieves a GFN to MFN mapping through aTranslation Lookaside Buffer of the GPU.

At block 450, processing logic translates the guest physical address toa corresponding machine physical address (e.g., the GPU performs therequested GFN to MFN translation).

At block 455, processing logic issues a lookup table update to the GPUcausing the GPU to update the lookup table.

In accordance with one embodiment, a non-transitory computer readablestorage medium stores instructions that, when executed by processors(e.g., such as a CPU and also a GPU) in a computing system, theinstructions cause the computing system to perform one or more of theoperations set forth in the flow diagram 400. For example, instructionsmay cause the processors of the system to perform operations including:managing, via a hypervisor, one or more guest machines on a system;configuring a lookup table with a mapping of guest physical addressesfor the one or more guest machines to corresponding machine physicaladdresses; transferring the lookup table to a graphics memory of agraphics processor unit; receiving an access request to a main memory ofthe system from one of the guest machines. In one embodiment, the accessrequest specifies a guest physical address; engaging an addresstranslation service internal to the graphics processor unit by passingthe guest physical address to the graphics processor unit andtranslating, via the address translation service of the graphicsprocessor unit, the guest physical address to a corresponding machinephysical address.

FIG. 5 illustrates a diagrammatic representation of a machine 500 in theexemplary form of a computer system, in accordance with one embodiment,within which a set of instructions, for causing the machine/computersystem 500 to perform any one or more of the methodologies discussedherein, may be executed. In alternative embodiments, the machine may beconnected (e.g., networked) to other machines in a Local Area Network(LAN), an intranet, an extranet, or the Internet. The machine mayoperate in the capacity of a server or a client machine in aclient-server network environment, as a peer machine in a peer-to-peer(or distributed) network environment, as a server or series of serverswithin an on-demand service environment. Certain embodiments of themachine may be in the form of a personal computer (PC), a tablet PC, asmart phone, a set-top box (STB), a Personal Digital Assistant (PDA), acellular telephone, a web appliance, a server, a network router, switchor bridge, computing system, or any machine capable of executing a setof instructions (sequential or otherwise) that specify actions to betaken by that machine. Further, while only a single machine isillustrated, the term “machine” shall also be taken to include anycollection of machines (e.g., computers) that individually or jointlyexecute a set (or multiple sets) of instructions to perform any one ormore of the methodologies discussed herein.

The exemplary computer system 500 includes a processor 502 without VT-d,a main memory 504 (e.g., read-only memory (ROM), flash memory, dynamicrandom access memory (DRAM) such as synchronous DRAM (SDRAM) or RambusDRAM (RDRAM), etc., static memory such as flash memory, static randomaccess memory (SRAM), volatile but high-data rate RAM, etc.), and asecondary memory 518 (e.g., a persistent storage device including harddisk drives), which communicate with each other via a bus 530. Mainmemory 504 includes a hypervisor 524 to manage virtual machines whichutilize GPU 501 and processor 502. Processor 502 operates in conjunctionwith the processing logic 526 to perform the methodologies discussedherein. In one embodiment GPU 501 utilizes a graphics memory 525 andaddress translation 527, each internal GPU 501.

The computer system 500 may further include a network interface card508. The computer system 500 also may include a user interface 510 (suchas a video display unit, a liquid crystal display (LCD), or a cathoderay tube (CRT)), an alphanumeric input device 512 (e.g., a keyboard), acursor control device 514 (e.g., a mouse), and a signal generationdevice 516 (e.g., an integrated speaker). The computer system 500 mayfurther include peripheral 536 devices (e.g., wireless or wiredcommunication devices, memory devices, storage devices, audio processingdevices, video processing devices, etc.).

The secondary memory 518 may include a non-transitory machine-readableor computer readable storage medium 531 on which is stored one or moresets of instructions (e.g., software 522) embodying any one or more ofthe methodologies or functions described herein. The software 522 mayalso reside, completely or at least partially, within the main memory504 and/or within the processor 502 without VT-d and/or GPU 501 duringexecution thereof by the computer system 500. The software 522 mayfurther be transmitted or received over a network 520 via the networkinterface card 508.

FIG. 6 depicts a block diagram of a system 600 in accordance with oneembodiment. The system 600 may include one or more processors 610, 615,which are coupled to graphics memory controller hub (GMCH) 620. Theoptional nature of additional processors 615 is denoted in FIG. 6 withbroken lines.

Each processor 610, 615 may be some version of the processor 502.However, it should be noted that it is unlikely that integrated graphicslogic and integrated memory control units would exist in the processors610, 615. FIG. 6 illustrates that the GMCH 620 may be coupled to amemory 640 that may be, for example, a dynamic random access memory(DRAM). The DRAM may, for at least one embodiment, be associated with anon-volatile cache.

The GMCH 620 may be a chipset, or a portion of a chipset. The GMCH 620may communicate with the processor(s) 610, 615 and control interactionbetween the processor(s) 610, 615 and memory 640. The GMCH 620 may alsoact as an accelerated bus interface between the processor(s) 610, 615and other elements of the system 600. For at least one embodiment, theGMCH 620 communicates with the processor(s) 610, 615 via a multi-dropbus, such as a frontside bus (FSB) 695.

Furthermore, GMCH 620 is coupled to a display 645 (such as a flat panelor touchscreen display). GMCH 620 may include an integrated graphicsaccelerator. GMCH 620 is further coupled to an input/output (I/O)controller hub (ICH) 650, which may be used to couple various peripheraldevices to system 600. Shown for example in the embodiment of FIG. 6 isan external graphics device 660, which may be a discrete graphics devicecoupled to ICH 650, along with another peripheral device 670.

Alternatively, additional or different processors may also be present inthe system 600. For example, additional processor(s) 615 may includeadditional processors(s) that are the same as processor 610, additionalprocessor(s) that are heterogeneous or asymmetric to processor 610,accelerators (such as, e.g., graphics accelerators or digital signalprocessing (DSP) units), field programmable gate arrays, or any otherprocessor. There can be a variety of differences between the physicalresources of processors 610, 615 in terms of a spectrum of metrics ofmerit including architectural, micro-architectural, thermal, powerconsumption characteristics, and the like. These differences mayeffectively manifest themselves as asymmetry and heterogeneity amongstthe processors 610, 615. For at least one embodiment, the variousprocessors 610, 615 may reside in the same die package.

FIG. 7 is a block diagram of a computer system 700 in accordance withone embodiment. In particular, a block diagram of a second system 700 isdepicted in accordance with an embodiment in which multiprocessor system700 is a point-to-point interconnect system, and includes a firstprocessor 770 and a second processor 780 coupled via a point-to-pointinterconnect 750. Each of processors 770 and 780 may be some version ofGPU 501 and/or processor 502 or as one or more of the processors 610,615.

While shown with only two processors 770, 780, it is to be understoodthat the scope of the present disclosure is not so limited. In otherembodiments, one or more additional processors may be present in a givenprocessor.

Processors 770 and 780 are shown including integrated memory controllerunits 772 and 782, respectively. Processor 770 also includes as part ofits bus controller units point-to-point (P-P) interfaces 776 and 778;similarly, second processor 780 includes P-P interfaces 786 and 788.Processors 770, 780 may exchange information via a point-to-point (P-P)interface 750 using P-P interface circuits 778, 788. As shown in FIG. 7,IMCs 772 and 782 couple the processors to respective memories, namely amemory 732 and a memory 734, which may be portions of main memorylocally attached to the respective processors.

Processors 770, 780 may each exchange information with a chipset 790 viaindividual P-P interfaces 752, 754 using point to point interfacecircuits 776, 794, 786, 798. Chipset 790 may also exchange informationwith a high-performance graphics circuit 738 via a high-performancegraphics interface 739.

A shared cache (not shown) may be included in either processor oroutside of both processors, yet connected with the processors via P-Pinterconnect, such that either or both processors' local cacheinformation may be stored in the shared cache if a processor is placedinto a low power mode.

Chipset 790 may be coupled to a first bus 716 via an interface 796. Inone embodiment, first bus 716 may be a Peripheral Component Interconnect(PCI) bus, or a bus such as a PCI Express bus or another thirdgeneration I/O interconnect bus.

As shown in FIG. 7, various I/O devices 714 may be coupled to first bus716, along with a bus bridge 718 which couples first bus 716 to a secondbus 720. In one embodiment, second bus 720 may be a low pin count (LPC)bus. Various devices may be coupled to second bus 720 including, forexample, a keyboard and/or mouse 722, communication devices 727 and astorage unit 728 such as a disk drive or other mass storage device whichmay include instructions/code and data 730, in one embodiment. Further,an audio I/O 724 may be coupled to second bus 720. Note that otherarchitectures are possible. For example, instead of the point-to-pointarchitecture of FIG. 7, a system may implement a multi-drop bus or othersuch architecture.

FIG. 8 is a block diagram of a computer system in accordance with oneembodiment. In particular, a block diagram of a system 800 is depictedin accordance with an embodiment in which processors 870, 880 mayinclude integrated memory and I/O control logic (“CL”) 872 and 882,respectively and intercommunicate with each other via point-to-pointinterconnect 850 between point-to-point (P-P) interfaces 878 and 888respectively. Processors 870, 880 each communicate with chipset 890 viapoint-to-point interconnects 852 and 854 through the respective P-Pinterfaces 876 to 894 and 886 to 898 as shown. For at least oneembodiment, the CL 872, 882 may include integrated memory controllerunits. CLs 872, 882 may include I/O control logic. As depicted, memories832, 834 coupled to CLs 872, 882 and I/O devices 814 are also coupled tothe control logic 872, 882. Legacy I/O devices 815 are coupled to thechipset 890 via interface 896.

FIG. 9 depicts a tablet computing device 901 and a hand-held smartphone902 each having a circuitry integrated therein as described inaccordance with the embodiments. As depicted, each of the tabletcomputing device 901 and the hand-held smartphone 902 include a touchinterface 903 and one or more integrated processors 904 in accordancewith disclosed embodiments.

In one embodiment, the GPU 100 is a Graphics Processor Unit typemicroprocessor within a tablet computing device or a smart phone or oneof a plurality of integrated processors 904 within the tablet computingdevice 901 or a hand-held smart phone 902. For example, the GPU 100based integrated processor 904 of a tablet computing device 901 or ahand-held smartphone 902 may implement a GPU based address translationservice utilizing a lookup table within graphics memory of the GPU asdescribed herein.

In one embodiment, tablet computing device 901 or the hand-heldsmartphone 902 includes a separate and distinct central processing unitcommunicatively interfaced with the GPU 100 within tablet computingdevice 901 or a hand-held smartphone 902 in which the central processingunit lacks dedicated hardware circuitry to perform address translationof guest physical addresses to machine physical addresses and musttherefore rely upon GPU 100 which implements address translationservices.

FIG. 10 is a block diagram 1000 of an embodiment of tablet computingdevice, a smart phone, or other mobile device in which touchscreeninterface connectors are used. Processor 1010 performs the primaryprocessing operations. Audio subsystem 1020 represents hardware (e.g.,audio hardware and audio circuits) and software (e.g., drivers, codecs)components associated with providing audio functions to the computingdevice. In one embodiment, a user interacts with the tablet computingdevice or smart phone by providing audio commands that are received andprocessed by processor 1010.

Display subsystem 1030 represents hardware (e.g., display devices) andsoftware (e.g., drivers) components that provide a visual and/or tactiledisplay for a user to interact with the tablet computing device or smartphone. Display subsystem 1030 includes display interface 1032, whichincludes the particular screen or hardware device used to provide adisplay to a user. In one embodiment, display subsystem 1030 includes atouchscreen device that provides both output and input to a user.

I/O controller 1040 represents hardware devices and software componentsrelated to interaction with a user. I/O controller 1040 can operate tomanage hardware that is part of audio subsystem 1020 and/or displaysubsystem 1030. Additionally, I/O controller 1040 illustrates aconnection point for additional devices that connect to the tabletcomputing device or smart phone through which a user might interact. Inone embodiment, I/O controller 1040 manages devices such asaccelerometers, cameras, light sensors or other environmental sensors,or other hardware that can be included in the tablet computing device orsmart phone. The input can be part of direct user interaction, as wellas providing environmental input to the tablet computing device or smartphone.

In one embodiment, the tablet computing device or smart phone includespower management 1050 that manages battery power usage, charging of thebattery, and features related to power saving operation. Memorysubsystem 1060 includes memory devices for storing information in thetablet computing device or smart phone. Connectivity 1070 includeshardware devices (e.g., wireless and/or wired connectors andcommunication hardware) and software components (e.g., drivers, protocolstacks) to the tablet computing device or smart phone to communicatewith external devices. Cellular connectivity 1072 may include, forexample, wireless carriers such as GSM (global system for mobilecommunications), CDMA (code division multiple access), TDM (timedivision multiplexing), or other cellular service standards). Wirelessconnectivity 1074 may include, for example, activity that is notcellular, such as personal area networks (e.g., Bluetooth), local areanetworks (e.g., WiFi), and/or wide area networks (e.g., WiMax), or otherwireless communication.

Peripheral connections 1080 include hardware interfaces and connectors,as well as software components (e.g., drivers, protocol stacks) to makeperipheral connections as a peripheral device (“to” 1082) to othercomputing devices, as well as have peripheral devices (“from” 1084)connected to the tablet computing device or smart phone, including, forexample, a “docking” connector to connect with other computing devices.Peripheral connections 1080 include common or standards-basedconnectors, such as a Universal Serial Bus (USB) connector, DisplayPortincluding MiniDisplayPort (MDP), High Definition Multimedia Interface(HDMI), Firewire, etc.

FIG. 11 is a block diagram of an IP core development system inaccordance with one embodiment. In particular, a block diagramillustrates the development of IP cores according to one embodiment inwhich storage medium 1130 includes simulation software 1120 and/orhardware or software model 1110. In one embodiment, the datarepresenting the IP core design can be provided to the storage medium1130 via memory 1140 (e.g., hard disk), wired connection (e.g.,internet) 1150 or wireless connection 1160. The IP core informationgenerated by the simulation tool and model can then be transmitted to afabrication facility 1165 where it can be fabricated by a 3rd party toperform at least one instruction in accordance with at least oneembodiment.

In some embodiments, one or more instructions may correspond to a firsttype or architecture (e.g., x86) and be translated or emulated on aprocessor of a different type or architecture (e.g., ARM). Aninstruction, according to one embodiment, may therefore be performed onany processor or processor type, including ARM, x86, MIPS, a GPU, orother processor type or architecture.

FIG. 12 illustrates an architecture emulation system in accordance withone embodiment. In particular, the architecture emulation systemillustrates how an instruction of a first type is emulated by aprocessor of a different type, according to one embodiment in whichprogram 1205 contains some instructions that may perform the same orsubstantially the same function as an instruction according to oneembodiment. However the instructions of program 1205 may be of a typeand/or format that is different or incompatible with processor 1215,meaning the instructions of the type in program 1205 may not be able toexecute natively by the processor 1215. However, with the help ofemulation logic, 1210, the instructions of program 1205 are translatedinto instructions that are natively capable of being executed by theprocessor 1215. In one embodiment, the emulation logic is embodied inhardware. In another embodiment, the emulation logic is embodied in atangible, machine-readable medium containing software to translateinstructions of the type in the program 1205 into the type nativelyexecutable by the processor 1215. In other embodiments, emulation logicis a combination of fixed-function or programmable hardware and aprogram stored on a tangible, machine-readable medium. In oneembodiment, the processor contains the emulation logic, whereas in otherembodiments, the emulation logic exists outside of the processor and isprovided by a third party. In one embodiment, the processor is capableof loading the emulation logic embodied in a tangible, machine-readablemedium containing software by executing microcode or firmware containedin or associated with the processor.

FIG. 13 illustrates a system to translate instructions in accordancewith one embodiment. In particular, a block diagram contrasting the useof a software instruction converter to convert binary instructions in asource instruction set to binary instructions in a target instructionset according to embodiments in which the instruction converter is asoftware instruction converter, although alternatively the instructionconverter may be implemented in software, firmware, hardware, or variouscombinations thereof. A program in a high level language 1302 may becompiled using an x86 compiler 1304 to generate x86 binary code 1306that may be natively executed by a processor with at least one x86instruction set core 1316. The processor with at least one x86instruction set core 1316 represents any processor that can performsubstantially the same functions as a Intel processor with at least onex86 instruction set core by compatibly executing or otherwise processing(1) a substantial portion of the instruction set of the Intel x86instruction set core or (2) object code versions of applications orother software targeted to run on an Intel processor with at least onex86 instruction set core, in order to achieve substantially the sameresult as an Intel processor with at least one x86 instruction set core.The x86 compiler 1304 represents a compiler that is operable to generatex86 binary code 1306 (e.g., object code) that can, with or withoutadditional linkage processing, be executed on the processor with atleast one x86 instruction set core 1316. Similarly, the program in thehigh level language 1302 may be compiled using an alternativeinstruction set compiler 1308 to generate alternative instruction setbinary code 1310 that may be natively executed by a processor without atleast one x86 instruction set core 1314 (e.g., a processor with coresthat execute the MIPS instruction set and/or that execute the ARMinstruction set). The instruction converter 1312 is used to convert thex86 binary code 1306 into code that may be natively executed by theprocessor without an x86 instruction set core 1314. This converted codeis not likely to be the same as the alternative instruction set binarycode 1310 because an instruction converter capable of this is difficultto make; however, the converted code will accomplish the generaloperation and be made up of instructions from the alternativeinstruction set. Thus, the instruction converter 1312 representssoftware, firmware, hardware, or a combination thereof that, throughemulation, simulation or any other process, allows a processor or otherelectronic device that does not have an x86 instruction set processor orcore to execute the x86 binary code 1306.

While the subject matter disclosed herein has been described by way ofexample and in terms of the specific embodiments, it is to be understoodthat the claimed embodiments are not limited to the explicitlyenumerated embodiments disclosed. To the contrary, the disclosure isintended to cover various modifications and similar arrangements aswould be apparent to those skilled in the art. Therefore, the scope ofthe appended claims should be accorded the broadest interpretation so asto encompass all such modifications and similar arrangements. It is tobe understood that the above description is intended to be illustrative,and not restrictive. Many other embodiments will be apparent to those ofskill in the art upon reading and understanding the above description.The scope of the disclosed subject matter is therefore to be determinedin reference to the appended claims, along with the full scope ofequivalents to which such claims are entitled.

What is claimed is:
 1. A system comprising: a main memory having aplurality of machine physical addresses; a graphics processor unithaving graphics memory therein; an address translation serviceintegrated with the graphics processor unit; a hypervisor to manage oneor more guest machines; wherein the hypervisor to configure a lookuptable within the graphics memory of the graphics processor unit; andwherein the address translation service of the graphics processor unitto translate a guest physical address for one of the one or more guestmachines to a corresponding machine physical address within the mainmemory.
 2. The system of claim 1, further comprising: one or moregraphics virtualization registers; wherein the one or more graphicsvirtualization registers inform the graphics processor unit that one ofthe guest machines have been assigned to the graphics processor unit bythe hypervisor.
 3. The system of claim 2, wherein the one or moregraphics virtualization registers further identify where the lookuptable is located for the assigned guest machine.
 4. The system of claim1, further comprising: a graphics virtualization determination componentintegrated within the graphics processor unit; wherein the graphicsvirtualization determination component is to determine whether or notthe graphics processor unit is operating within a virtualizationenvironment on behalf of one of the guest machines; wherein the graphicsprocessor unit to engage the address translation service when operatingwithin the virtualization environment; and wherein the system toimplement logical memory address to guest physical address translationwithout engaging the address translation service of the graphicsprocessor unit when the graphics processor unit is not operating withinthe virtualization environment.
 5. The system of claim 1, wherein thehypervisor to engage the address translation service of the graphicsprocessor unit by: configuring one or more graphics virtualizationregisters to inform the graphics processor unit that one of the guestmachines have been assigned to the graphics processor unit by thehypervisor; and passing a Guest physical Frame Number (GFN) to thegraphics processor unit for translation by the address translationservice to a Machine Frame Number (MFN), wherein the GFN represents theguest physical address for one of the one or more guest machines andwherein the MFN represents the machine physical addresses within themain memory corresponding to the guest physical address.
 6. The systemof claim 1, wherein: an address translation service integrated with thegraphics processor unit retrieves Guest physical Frame Number (GFN) toMachine Frame Number (MFN) mappings from the lookup table in thegraphics memory.
 7. The system of claim 6, wherein: the addresstranslation service retrieves the GFN to MFN mappings through aTranslation Lookaside Buffer (TLB) integrated within the graphicsprocessor unit; and wherein the TLB provides an indexable cache onbehalf of the address translation service of the graphics processorunit.
 8. The system of claim 1: wherein the hypervisor comprises avirtual machine manager; and wherein the one or more guest machinescomprise one or more virtual machines.
 9. The system of claim 1, furthercomprising: a separate and distinct central processor unit communicablyinterfaced with the graphics processor unit via a system bus; whereinthe central processing unit lacks dedicated hardware circuitry toperform address translation of guest physical addresses to machinephysical addresses.
 10. The system of claim 1, wherein the graphicsprocessor unit comprises a silicon integrated circuit type graphicsprocessor unit.
 11. The system of claim 1, wherein the graphicsprocessor unit comprises a microprocessor within a tablet computingdevice or a smart phone or one of a plurality of microprocessorsintegrated within the tablet computing device or the smartphone.
 12. Thesystem of claim 11, wherein the tablet computing device or thesmartphone comprises a separate and distinct central processing unitcommunicatively interfaced with the graphics processor unit within thetablet computing device or the smartphone; and wherein the centralprocessing unit lacks dedicated hardware circuitry to perform addresstranslation of guest physical addresses to machine physical addresses.13. The system of claim 1: wherein the graphics memory comprises sharedgraphics memory; and wherein the hypervisor to configure the lookuptable within the graphics memory of the graphics processor unitcomprises the hypervisor to write the lookup table into the sharedgraphics memory of the graphics processor unit.
 14. The system of claim1: wherein the hypervisor to configure the lookup table within thegraphics memory of the graphics processor unit comprises the hypervisorto instruct the graphics processor unit to retrieve and store the lookuptable; wherein the graphics processor unit to issue a Direct MemoryAccess (DMA) request to fetch the lookup table; and wherein the graphicsprocessor unit to store the lookup table in the graphics memory.
 15. Thesystem of claim 1: wherein the hypervisor to issue a lookup table updateto the graphics processor unit; and wherein the graphics processor unitto responsively update the lookup table in the graphics memory.
 16. Amethod comprising: managing, via a hypervisor, one or more guestmachines on a system; configuring a lookup table with a mapping of guestphysical addresses for the one or more guest machines to correspondingmachine physical addresses; transferring the lookup table to a graphicsmemory of a graphics processor unit; receiving an access request to amain memory of the system from one of the guest machines, wherein theaccess request specifies a guest physical address; engaging an addresstranslation service internal to the graphics processor unit by passingthe guest physical address to the graphics processor unit; andtranslating, via the address translation service of the graphicsprocessor unit, the guest physical address to a corresponding machinephysical address.
 17. The method of claim 16, wherein the graphicsprocessor unit comprises one or more graphics virtualization registers;wherein the hypervisor configures the one or more graphicsvirtualization registers to inform the graphics processor unit that oneof the guest machines have been assigned to the graphics processor unitby the hypervisor; and wherein the one or more graphics virtualizationregisters further identify where the lookup table is located for theassigned guest machine.
 18. The method of claim 17, further comprising:passing a Guest physical Frame Number (GFN) from the hypervisor to thegraphics processor unit requesting translation of the GFN by the addresstranslation service to a Machine Frame Number (MFN), wherein the GFNrepresents the guest physical address for one of the one or more guestmachines and wherein the MFN represents the machine physical addresseswithin the main memory corresponding to the guest physical addressspecified by the access request.
 19. The method of claim 16, whereintranslating the guest physical address to a corresponding machinephysical address comprises the address translation service of thegraphics processor unit performing the translating based on a Guestphysical Frame Number (GFN) to Machine Frame Number (MFN) mapping fromthe lookup table in the graphics memory.
 20. The method of claim 19,wherein: the address translation service retrieves the GFN to MFNmapping through a Translation Lookaside Buffer (TLB) integrated withinthe graphics processor unit; and wherein the TLB provides an indexablecache on behalf of the address translation service of the graphicsprocessor unit.